Review of Bioinformatics and Biometrics (RBB) Volume 2 Issue 2, June 2013 



www.seipub.org/rbb 



Mining Temporal Association Rules from 
Time Series Microarray Using Apriori 
Algorithm 

Bhuvaneswari V. 1 , Umajothy P. 2 
Assistant Professor 1 , M Phil Research Scholar 2 

Department of Computer Applications, School of Computer Science and Engineering, Bharathiar University, 

Coimbatore, Tamil Nadu, India 

bhuvanes _v@yahoo.com; umajo7@gmail.com 



Abstract 

Microarray Technology is used to automate the diagnostic 
task and improve the accuracy of the traditional diagnostic 
techniques. Analysis of microarrays presents a number of 
unique challenges for data mining. It is a high throughput 
method to analyze expression levels of multiple genes 
simultaneously. The concept of association plays an 
important role in microarray to find the association of 
expression levels of multiple genes. In the proposed work, 
the association rule mining technique is used to mine gene 
expression data in order to analyze the affects of expression 
of one gene with other gene for gene functionality biological 
Process and Molecular Function. The yeast Saccharomyces 
cerevisiae dataset is used for the work where Apriori 
algorithm is used to find the association of genes for 
different experimental condition for time interval 2 and 4. 
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Introduction 

Data mining is the extraction of hidden predictive 
information from large databases. Data mining is a 
larger process known as Knowledge Discovery in 
databases (KDD). The process of discovering 
meaningful, new correlation patterns and trends by 
shifting through large amount of data stored in 
repositories, using pattern recognition techniques as 
well as statistical and mathematical techniques. 
Bioinformatics is the science of managing, mining and 
interpreting information from biological sequences 
and structures [1 ]. 

Bioinformatics and data mining provide exciting and 
challenging researches in several application areas 
especially in computer science. A microarray is a 



sequence of dots of DNA, protein, or tissue arranged 
on an array for easy simultaneous analysis. The DNA 
microarray plays an integral role in gene expression 
profiling. Alternative names for DNA microarray are 
gene chip, DNA chip and Biochip [3 ]. Gene is a section of 
DNA at a specific position on a particular chromosome 
that specifies the amino acid sequence for a protein. 

A gene of an organism is composed of many genetic 
interacting elements. It is a tough process to discover 
the genetic interacting elements in the complex 
biological regulations. The microarray technique 
allows researchers to observe the expression levels of 
thousands of genes in a single experiment. Gene 
expression analysis in a microarray experiment is used 
to monitor the expression levels of genes at a genome 
scale. Various association algorithms are used to 
identify the expression levels of thousands of genes 
simultaneously under a particular condition. The 
microarray consists of gene values which are 
interrelated with one another as well as time 
constraints where time expressions are included. The 
problem of handling time series data by the proposed 
method using association rule mining helps to extract 
the temporal dependency among genes. 

An association rule mining is used to find the 
interesting association and correlation relationships 
within items in a large database. An association rule is 
a pair of disjoint itemsets and if LHS and RHS denote 
the two disjoints itemsets, then it can be written as 
LHS -> RHS. Support and confidence are the two 
parameters of the association rule. 

• Support measure is to find all the frequent item 
sets that satisfy minimum support threshold 
value. This can be represented as 
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Support = T(AUB) 



• Confidence is used to find all the high 
confidence rules from frequent item sets in the 
support values. Confidence of the rule with 

Confidence =T(B/A) 



LHS-> RHS with the transaction set T is the 
ratio support (LHS-> RHS)/ support (LHS). 

A temporal association rule represents the various 
transcriptional time delays between associated genes. 
It has the form [gene Af, gene BJ,] — > (7 min) [gene C|], 
which represents the high expression level of gene A 
and B followed by expression of gene C after some 
minutes [6]. 

Definition 1: A temporal item is an item which has a 
time stamp and a temporal item set is a non-empty set 
of temporal items. A temporal association rule 
expresses that a set of items tends to appear along with 
another set of items in the same transactions, in a 
specific time frame. 

Definition 2: A temporal association rule is a pair of 
disjoint temporal item sets. LHS and RHS denote the 
left and right temporal item sets. The temporal 
association rule is written as LHS > (A) RHS, where A is 
the interval of two different time stamps. The temporal 
association rule extracts the size of the transcriptional 
time delay of the associated genes (minutes and the 
activation and inhibition relationship [gene Af-^ gene 
CJ,] of the co regulation of genes (gene A|, gene BJ,). 

In this work, a method is proposed to extract the 
temporal association rules from the yeast microarray 
data using Apriori algorithm and analyze the similar 
pattern for genes in different time series of the 
experimental data. 

The paper is organized as follows. Section 2 describes 
the literature review of various association rule mining 
algorithms used to analyze the microarray data. 
Section 3 describes the proposed methodology for 
mining the temporal association rules using Apriori 
algorithm. Section 4 depicts the experimental results of 
the proposed method. In section 5, conclusion of the 
work is made. 

Related Work 

Microarray data presents new challenges which make 
many traditional data mining techniques infeasible to 



extract the hidden gene relationships. The main 
challenge is its high density -a number of attribute 
(columns) and a considerably smaller number of 
expression experiments (rows). To use current data 
mining algorithms, biologists manage to simplify the 
complexity of their data by means of building 
restriction on the analysis to small proportion of 
attributes. The microarray data consists of genes 
names with expressions which are interrelated with 
one another. The association relationship between the 
genes can be found by association rule mining whose 
algorithms like CLOSET, Apriori, and Partition 
algorithm are used to find the frequent item sets and 
association rules. 

Association rule mining (ARM) widely used for 
mining large databases was originally introduced to 
handle market basket data for consumer purchasing 
patterns in various application areas. The exploration 
of mining microarray data has led to many proposals 
of mining the association rules. 

In [17], an introduction has been made to a data 
mining technology "Data Mining Ready" with data 
structure called Peano count tree for Association Rule 
Mining. Algorithms like, Peano- ARM (Association rule 
mining) and P-gen algorithm are also used for mining. 
The microarray data is organized into a bit-sequential 
format where each bit file is converted to quadrant 
base P-trees that are easy to derive rules from 
microarray data. 

In [1], the problem of mining association rules has 
been discussed. The comparison of the algorithms like 
Apriori and Apriori TID and a new algorithm called 
Apriori hybrid have been proposed. The efficiency of 
the algorithms is tested and compared with that of the 
algorithms AIS (Artificial Immune system) for mining 
large databases. In [14], an introduction on a study on 
the analysis of DNA microarray data has been 
depicted using the association rules. Association rule 
mining algorithms like Apriori, FP tree growth, 
Partition, Dynamic-FP growth, Dynamic Item set 
counting (DIC) are to analyze and associate the gene 
expression data. The algorithms are explained to find 
the frequent item sets and association rules. In [15], an 
APD method called the Association Pattern Discovery 
has been put forward to discover frequent item sets 
and association rules. The co-regulated gene profiles 
are discovered by MAP (Mining attribute Profile) 
algorithm in the yeast microarray datasets that is 
compared with the traditional APD methods and 
resulted in the fact that the MAP was proved to be the 
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best performance. In [16], a new approach for mining 
FIS (Frequent item set) tree mining algorithm was 
introduced making use of bit string data partition 
format to propose association relationships among 
different genes by means of two data structures called 
BSC and FIS tree, the first of which acts as a string 
compression tree for each bit string representing each 
gene while FIS tree is to store all the frequent item sets. 

In [2], an incremental mining of association rules was 
introduced using an extended TFP Apriori tree (Total 
from Partial) that in exploration handles the problem 
of mining association rules in incremental databases 
by building TFP trees incrementally. In [6], a temporal 
association rule mining method has been proposed to 
extract the temporal dependencies among the related 
genes using the Apriori algorithm. The inferences of 
the results are done by Gene Ontology and KEGG 
pathway. 

In [8], a new algorithm called T- Apriori has been put 
forward which extracts the temporal association rules 
with respect to time. The frequent item sets are 
extracted with its support and confidence values. In 
[12], an algorithm called SPFA (Standing for 
Segmented Progressive Filter Algorithm) was 
introduced for large databases, generating the 
temporal frequent item sets, and temporal sub items. 
The first part of the algorithm divides the database 
into partitions, and the second part filters the 2-item 
sets. The algorithm minimizes the execution time by 
using scan reduction technique generating all the 
candidate item sets. In [18], a regularized neural 
network model was described for characterization of 
the multiple heterogeneous temporal dynamic patterns 
of gene expression. A feed forward neural network 
model is introduced to model the gene expressions. 
The method is performed in yeast microarray data and 
compared with the Nearest Neighbor, SVM (support 
vector machine), and self organized maps resulting in 
a best performance method. 

Methodology 

The objective of the proposed work is to extract the 
temporal association rules for yeast microarray dataset 
from different experimental Conditions using Apriori 
algorithm. The framework for the work is given in FIG 
1. 

The Framework consists of three phases. In the first 
phase, the preprocessing of microarray data is done, in 
the second phase, the genes are classified based on the 
functionality using GO ontology. The temporal 



association rules are extracted for the intervals 2 and 4 
and the similar pattern of rules are extracted and 
compared. 
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FIG. 1 EXTRACTION OF TEMPORAL ASSOCIATION RULES 

Preprocessing Microarray Data 

The yeast microarray data is used as dataset to 
implement the proposed work. The Saccharomyces 
cerevisiae yeast data consists of 6400 genes in which 
empty spots and null values are inclusive. In the 
preprocessing step, the genes with empty spots and 
null values are removed using the Knimpute method. 
After that the dataset has 6314 genes which are used to 
analyze the temporal patterns. 

Classifying Genes Based on Functionality 

The yeast microarray data is classified based on the 
functionality using Gene Ontology (GO) which is a 
controlled vocabulary used to describe the function of 
gene Products and divided into three functionalities: 
Molecular Function (MF), Biological Process (BP) and 
Cellular Component (CC). Molecular Function is used 
to describe the function of gene product at molecular 
level; Biological Process describes the participation of 
gene in biological activities. Cellular component 
depicts the location of the gene product at cell level. 
The gene in microarray data is mapped to the gene 
functionality using the SGDann structure. 

SGDann, a master structure of Yeast microarray data 
contains the parameters namely SGD aspect, 
SGDgenes, SGDgo. The entire microarray data is 
mapped with this structure (SGDgenes) and the gene 
expression in the microarray data is partitioned based 
on the functionality of gene for Biological Process and 
Molecular Function with corresponding time series 
values. 1079 gene expression data for Biological 
Process and 1153 for Molecular Function extracted are 
used to find the temporal association of genes. 
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Association of Microarray Genes 

Association rule mining is the one that discovers the 
frequent patterns, associations and co-relations of item 
sets which are meaningful to the users and generates 
the strong rules on the basis of frequent patterns. 
Association rule mining is used to identify the 
expression of one particular gene affecting the 
expression of other genes. The gene expression data 
for Molecular Function and Biological Process 
employed to find the gene association for time interval 
2 and 4 consists of two phases, first of which is the 
conversion of discrete values and the other is the 
extraction of temporal associations among genes. 

(1) Conversion of Gene Values to Discrete Values 

The gene expression values is converted to discrete 
values as up regulated and down regulated gene to 
extract the temporal association rules for the time 
intervals 2 and 4. Up-regulation is the increase in 
expression of a gene in which the transcription of a 
specific mRNA is increased. It is denoted by the 
symbol "U". Down-regulation is the decrease in the 
number of receptors for a Chemical or drug on cell 
surfaces in a given area; and it is denoted by the 
symbol "D". The original gene expression matrix is 
converted to discrete values as up regulation and 
down regulation using the Eq (1) and Eq (2) shown 
in Table 1 and Table 2. 



If (Gene value >0) then 'TP — 


-Eq(l) 


If (Gene value <0) then"D"-~ 
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TABLE 2 DISCRETE VALUES 
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(2) Extracting Temporal Associations 

The discretized data is used as input to extract 
temporal association of gene for time interval 2 and 
4 for Molecular Function and Biological Process. 

The time series value of each gene ranges from TO, 
Tl, T2, T3, T4, T5, and T6. The time stamps of the 
transactions sets for the Molecular Function and 
Biological Process for the time interval A=2 are set 
as T0+T2, T1+T3, T2+T4, T3+ T5, T4+T6 and time 
interval A=4 is set as T0+T1+T2+T3, T1+T2+T3+T4, 
T2+T3+T4+T5,T3+T4+T5+T6. The association rule 
mining Apriori algorithm is used to extract the 
temporal association for the time stamps in interval 
2 and 4. The gene expression data is analyzed for 
various support and confidence measures. The 
rules are extracted and the association of gene is 
analyzed to identify the similar co-occurrence of 
genes in Molecular Function and Biological Process 
for time intervals 2 and 4. The pseudo code of the 
algorithm is shown below. 

Pseudo Code 



Ck= candidate item set of size K 
Lk= frequent item set of size K 
Ll= Frequent item sets; 
For (K=l; Lk=_; k++) 
Ck+1= item set generated from Lk; 
For all transactions t, do 
Increment the count Ck+1 in t; 
Lk+1= candidates in Ck+1 with min_support 
End 
Return Ilk Lk: 

The transaction sets is extracted with its gene name 
with corresponding discrete values. The value 
consists of up and down regulated genes at 
different time intervals. The first gene in the Table 2 
is taken and associated with other forth coming 
genes in the same interval. 

TABLE 3 TEMPORAL ASSOCIATION OF GENES FOR TIME INTERVAL 
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Likewise in each transaction, the gene name with 
its values in the intervals i.e., T1+T3, T2+T4... T4+T6 
are checked. A relation which occurs in the 
intervals for support 50, 80,100 is taken as the 
frequent item sets for the time interval 2 and 4. The 
below Table 3 and Table 4 shows the sample of 3 
genes with the association for the intervals A=2 and 
A=4 for Molecular function and Biological Process. 

TABLE 4 TEMPORAL ASSOCIATION OF GENES FOR TIME INTERVAL 4 
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The proposed method extracts the temporal 
association rules from the temporal Patterns for the 
Molecular Function and Biological Process for the 
time intervals 2 and 4 which are explained in 
experimental results. 

Experimental Results 

Dataset 

The results of the proposed work are analyzed for 
temporal association among genes in the dataset from 
Molecular Function and Biological Process and then 
compared to find the similar Pattern of rules. 

The Yeast Saccharomyces Cerevisiae Dataset is 
downloaded from National Center for Biotechnology 
Information Gene Expression Omnibus (NCBI GEO) 
website for the proposed work. The Yeast microarray 
data contains about 6400 Genes (e.g. YAL051W, 
YAL054C) with their corresponding Yeast values 
(0.1650, 0.2720). The file "yeastgenes.sgd" is obtained 
from the GO annotation site. The number of genes in 
the dataset is shown in the below Table 5. 



TABLE 5 DATASET 
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The yeast microarray dataset contain 6400 genes. After 



removal of empty genes from the dataset, there are 
about 6314 genes which are shown in FIG 2. The entire 
microarray dataset is mapped based on the 
functionality. After mapping, 1079 has been acquired 
for Biological process, 1153 for Molecular function is 
shown in FIG 3. 




FIG. 2 INSTANCES OF PREPROCESSING 



Temporal Associations 

The temporal associations of gene for 1153 gene of 
Molecular function and 1079 gene of Biological Process 
are extracted using Apriori algorithm for time stamps 
A=2 and A=4. The association rule mining algorithm is 
applied to Biological Process and Molecular Function 
genes for various support measures 50, 80, 100 
respectively. The temporal association rules are 
extracted for interval 2 and 4 for the Molecular 
Function and Biological Process for support 50 with 
239 and 236 rules and for support 80 with 281 and 355, 
and for 100 with 261 and 329 rules which is shown in 
FIG 3. 
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FIG. 3 TEMPORAL RULES OF MOLECULAR FUNCTION FOR A= 2, 

A=4 

TABLE 6 TEMPORAL ASSOCIATION RULES FOR SUPPORT MEASURES 
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The results of the proposed work extracted with the set 
of rules for Molecular Function and Biological Process 
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for the time interval 2 and 4 for the co-occurrence of 
genes are discussed in section 5. 

Discussion 

The temporal association rules extracted for the 
Molecular function for the time intervals 2 for the 
support 50, 80, 100 for the co- occurrence of genes are 
presented. TABLE 6 provides the snapshot of set of co- 
occurrence of gene. The gene YAL018CD which is 
down regulated associated with the genes like 
YAL027WD, YAL037WD, YAL065CU, YAR009CD, 
YAR023CU, YAR061CD, YAR062WU; YAR064WD 
which are up and down regulated is present for 
support 50%. The same set of gene is found to occur 
for support 80%. The co-occurrence patterns are not 
found for support of 100. Likewise gene YAL027WD 
with down regulation is associated with the genes 
YARO10CD, YAR010CU, YBL005WD, and 
YBL005WU which are up and down regulated. For 
support 80%, the gene YAL027WD with down 
regulation is also associated with the genes 
YARO10CD, YARO10CU, and YBL005WD with up 
and down regulation. This set of co-occurrence is not 
found for 100% support. Likewise, the gene 
YAL037WD with down regulation is associated with 
the genes YAL065CD, YAR009CU, YAR010CD and 
YAR064WU with up and down regulation. This 
association rule is found to occur for support 80% but 
not in the 100%. 

TABLE 7 TEMPORAL RULES OF INTERVAL 2 OF MOLECULAR FUNCTION 
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support 50, 80,100 for the co- occurrence of genes are 
presented. The TABLE 8 provides the snapshot of set 
of co-occurrence of gene. The gene YAR029WD which 
is down regulated associated with the genes like 
YAR062W(DUUU) and YBL095W(DUDU) and 
YBR028C (DUUU) with up and down regulation is 
found to occur in the 50%. The same set of gene is 
found to occur for the support 80% and 100%. 
Similarly, the second set of gene for the support 50 % 
gene YBR138CD which is down regulated is associated 
with the genes YBR285W(DUUD), YDL176W(DDUU) 
with up and down regulation. The same set of gene is 
found to occur for the support 80% and 100%. 

TABLE 8 TEMPORAL RULES OF INTERVAL 4 OF MOLECULAR FUNCTION 
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The temporal association rules extracted for the 
Molecular Function for the time intervals 4 for the 



The data set consists of 1153 gene of Molecular 
Function and 1079 gene of Biological Process. It is seen 
that only 1132 genes of Molecular Function gene are 
associated with other genes in the corresponding 
support measures and 21 genes are not found to co- 
occur with other genes like YGL057C , YGL081W, 
YPL030W, YCR051W, YDR520C, YDR222W, YDR266C, 
YJL149W, YPL183QYLL017W, YFR006W, YPR152C, 
YIL177C, YJL149W, YNR021W, YHR177W,YKR011C, 
YDR287W, YKL033W, YDR415C, YDR539W. 

On the other hand, the Biological Process of 1079 gene, 
a count of 1061 genes co-occur with other genes in its 
corresponding support measures and the remaining 18 
genes are not found to occur with other genes like 
YGL057C, YGL081W, YPL030W, YDR266C, YJL149W, 
YPL183C, YLL017W, YFR006W, YPR152C, YIL177C, 
YJL149W, YNR021W, YHR177W, YKR011C,YDR287W, 
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YKL033W, YDR415C, YDR539W. 

The temporal association rules extracted for the 
Molecular function and Biological Process of time 
interval 2 and 4 are compared to analyze the similar 
pattern of rules. The temporal association rules 
extracted for the time intervals 2 and 4 for the 
Molecular Function and Biological Process genes using 
the Apriori algorithm are taken and the similar 
patterns of rules for the time intervals 2 and 4 for the 
functionalities are analyzed. 

The similar pattern of rules extracted for the support 
50% with interval 2 is 153 rules, support 80% is 154 
and 142 rules for the support 100%. Likewise, the rules 
extracted for the interval 4 for support 50% is 53, 
support 80% is 132 and 100% is 163 rules. The similar 
pattern of association rules for the Molecular function 
consisting of 1153 genes and 1079 genes of Biological 
Process extracted is shown in the FIG 4. 



FIG. 4 SIMILAR PATTERN OF RULES OF TIME INTERVAL 2 

The similar pattern of rules extracted for the Molecular 
function and Biological Process for the time intervals 2 
and 4 is shown above. It is seen that the rules extracted 
for time interval 2 seems to be increasing as 153 rules 
for the support 50% and 154 for the support 80%. 

A slight variation in the rules which get decreased for 
the support 100% with 142 rules. In case of time 
interval 4, the similar pattern of rules starts with 53 
rules for the support 50, a sudden increase in the 
similar pattern of rules for the support 80 with 132 
rules and 163 rules for the support 100. 

Conclusion 

Microarray technology is a tool which helps us to 
analyze the expression levels of thousands of genes 
simultaneously. To analyze the gene expression data, 
association rule mining is used to discover the 
frequent patterns, co-relations of the microarray genes. 



The proposed method uses the Yeast Saccharomyces 
cerevisiae dataset to find the association using the 
Apriori algorithm in different experimental conditions 
for time interval 2 and 4. The original microarray 
dataset consists of two functionalities called the 
Molecular function with 1153 gene and Biological 
Process with 1079 gene. The dataset is processed to 
extract the temporal association rules for the time 
interval 2 and 4 for the support measures. The 
temporal association rules extracted for the Molecular 
function and Biological Process for the time interval 2 
for the support 50% is 239 rules, 80% is 281 rule and 
for support 100% is 261 rules. Likewise, the rules 
extracted for the time interval 4 for the support 50% is 
236 rules, 80% is 360 rules ,100 is 329 rules. A slight 
difference occurs in the support 80% in the interval 4 
of Biological Process and Molecular Function. When 
the extracted temporal association rules are compared 
for the Molecular function and Biological Process, it is 
found that 80% of temporal rules are same. 

The extracted temporal association rules for the 
Molecular function and Biological Process are taken 
and analyzed to extract the similar pattern of rules. 
The similar pattern of rules extracted for the support 
50, 80,100 for the time interval 2 seems to be decreased 
as 153,154,142 rules. In case of time interval 4, the rules 
seems to be increased as 53, 132,163 rules. 

The biological inference of similar pattern of rules for 
the time interval 2 seems to be decreased and 
increased in interval 4. The similar pattern of rules 
extracted for the support measures is again analyzed 
to find the non associated gene. It is found that a 
number of 21 gene are found to be irrelevant to other 
genes in the Molecular function and in the Biological 
Process, as well as a number of 18 gene The non- 
association genes for the Molecular function and 
Biological Process are YGL057C, YGL081W, YPL030W, 
YDR266C, YJL149W, YPL183C, YLL017W, YFR006W, 
YPR152C, YIL177C, YJL149W, YNR021W, YHR177W, 
YKR011C, YDR287W, YKL033W, YDR415C, YDR539W. 
When the temporal rules of the proposed work are in 
analysis, it is found that 80% of the similar patterns of 
temporal association rules of the Molecular function 
and the Biological Process for the intervals 2 and 4 are 
same. The biological validation is done in the Future 
work. 
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